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Abstract 

This paper presents four different ways of looking at the well-known Least Squares Temporal 
Differences (LSTD) algorithm for computing the value function of a Markov Reward Process, each 
of them leading to different insights: the operator-theory approach via the Galerkin method, the 
statistical approach via instrumental variables, the linear dynamical system view as well as the limit 
of the TD iteration. We also give a geometric view of the algorithm as an oblique projection. Fur- 
thermore, there is an extensive comparison of the optimization problem solved by LSTD as compared 
to Bellman Residual Minimization (BRM). Also, we look at the case where the matrix being inverted 
in the usual formulation of the algorithm is singular and show that taking the pseudo-inverse is the 
optimal thing to do then. We then review several schemes for the regularization of the LSTD solution. 
Moreover, we describe a failed attempt to derive an asymptotic estimate for the covariance of the 
computed value function, as well as describe one particular Bayesian scheme where such a covariance 
could be plugged in. We then proceed to treat the modification of LSTD for the case of episodic 
Markov Reward Processes. 

g 

1 Novel contributions of this paper 
^v^j , The main contribution of the present paper is the bringing together of the many perspectives that 

have been applied to thinking about LSTD. However, there is also novel material. In particular, we note 
the following: in section [BJ we give the first correct derivation of LSTD via instrumental variables. In 
this respect, the paper [BJ, although seminal and certainly right is its core message, suffers from several 
serious formal shortcomings such as entirely ignoring the fact that the Bellman operator may take the 
value function out of the class representable by linear function approximators - compare equations in 
section 5.2 of [BJ, in addition to being unnecessarily complicated in its technical aspects. In contrast, 
our derivation is barely a page long and addresses the formal issues well. In section [7J we provide, as 
compared to |19| . an additional way of decomposing LSTD into an oblique projection as well as the proof 
of a formal fact that makes the label 'oblique projection' right in the first place. In section [§1 we do 
not introduce any new formulae but we believe that a direct comparison of the optimization problems 
shown in the provided table is the first where all the given ways of looking at the problem are present 
in one place. In section [TU1 the convergence analysis of the matrix series, necessary to formally compete 
the derivation, is new. The section [T2] is new but we do accept the essence must have been informally 
\ realized for some time before. Our analysis of what happens when the matrix that LSTD inverts becomes 

singular, done in section[T4]is entirely new. In section [TBI our the idea to use the linear dynamical system 
view of LSTD for regularization is new in this context, although is has been strongly inspired by [BJ, 
■ which however solves a different problem. The attempted covariance derivation of section [TBJ is entirely 

new. The analysis of the episodic LSTD solution in terms of the feature matrix, the MRP transition 
matrix and the mean reward vector in section [T5] is new. Finally, a fact about the assumption behind 
LSTD proved in section [L9l is also entirely new. 
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2 Motivation 

The main practical problem that the LSTD algorithm solves is such: we are given a feed of data from 
a stochastic system, consisting of a state description in terms of features and of rewards. The task is 
to construct an abstraction that maps from states to values of states, where the value is defined as the 
discounted sum of future rewards. For example, the system may describe a chess game, the features of 
state may describe what pieces the players have while the reward signal corresponds to wither winning or 
losing the game. The value signal will then correspond to the value of having each particular piece. Note 
that this is not a general constant but may depend on the way the individual players play the game, for 
example the values may be different for humans than for computer players. 

Associated with our problem setting is the question whether the value function is interesting in its 
own right, or whether we only need it to adjust the future behaviour of some aspect of the environment we 
can control (i.e. in our chess example, make a move). We believe that there is large scope of systems (for 
instance expert systems) where the focus will be on gaining insight into the behaviour of the stochastic 
system, but the decisions about whether or how to act will still be made manually by human controllers, 
on the basis of the value-function information. These are the cases where algorithms like LSTD are the 
most directly applicable. On the other side of the spectrum, there will also of course be situations where 
the value function estimate is used as a tool to automatically generate the best action on the part of the 
agent - such systems may also use value-function estimation algorithms of the kind of LSTD if they are 
chosen to operate within the policy iteration framework. 
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3 Problem setting and notation 

We are concerned with the problem of finding the value function of a finite-state Markov Reward 
Process (MRP). We only have access to linear features of states and to the obtained rewards. More 
formally, denote as P the transition matrix of the MRP. For each state s we have a feature row-vector 
4> s . The feature design matrix gives the features of all states of the MRP, row- wise. We know, since 
we have an MRP, that a random reward obtained while leaving a given state only depends on that state. 
We introduce the vector td , the z-th element of which contains mean reward obtained while leaving the 
state i. We use £ to denote a left eigenvector of P with eigenvalue one. Note that if the chain has 
a stationary distribution, it will correspond to one such eigenvector, but we do not require it. Indeed 
we treat each recurrent class separately and give the current class the index c. We also introduce the 
matrix S c = diag(£). We now define expectations of functions of the Markov process in terms of weighted 
averages. For example the expectation of 4> T 4>, is defined by E c [</> T </>] = <f>JS c <f>£), and similarly for 
other functions. Here, we mean that for each recurrent class c there exists such S c that it is legitimate to 
consider the above quantity an expectation corresponding to long-time average by the standard ergodic 
theorem for Markov chains (applied to states within the recursive class c). Note that we still allow for 
aperiodicity, i.e. the diagonal of 5 C may not be a stationary distribution, but the expression still matches 
the long-time average. We use subscripts do denote two-step sampling, for example <p' s denotes the fact 
that we first sample a state, then the successor state and obtain the feature of that successor state. 
When we write an expectation w.r.t. such a variable, for example E c [r^] , the distribution we mean for r 
is P( r l s )£s- We note that the MRP is fixed, i.e. we only consider the on-policy setting. We use bold 

letters to denote random variables, for instance s denotes state and cf> denotes feature. Once we have 
obtained samples from our process, we store them in matrices $5 and r$, whose i-th rows correspond to, 
respectively, the state feature vector and reward obtained at time i. Observe the difference between $£> 
and $5 - in the first one, each state is represented once, in the second one the number of rows corresponds 
to the trajectory taken in the MRP and repetitions are possible. The value function is discounted with 
the factor < 7 < 1. Moreover, we introduce the square matrix D which has ones on the main diagonal 
and —7 on the diagonal above it. It is the sample based equivalent to the operator / — 7P. 

4 The approximation that LSTD makes 

Now we can write the Bellman equation, which defines the true value function: V (s) — E c [r + jV (s') \ s] 
E c [r I s] + -fE c [V (s') \s\. It can be rewritten in matrix form as Vd = {I — jP)~ 1 i*d- Here, the vector 
Vd contains the value of the function V (•) at each state. Now the scope for the LSTD algorithm is 
where it is not possible to use this equation to compute the value function exactly (i.e. compute the 
table-lookup solution) for two reasons: (1) we may not have access to the states directly, just to functions 
<j) s and (2) even if we did, this would require sample-based versions of the matrices P, and rjj, the first 
of which being particularly intractable because the number of entries is square in the (already large) 
number of states. Thus we exploit a linear architecture: i.e. we seek to approximate the true value func- 
tion V (•) with the function V (s) = (j)sW, which is linear in w; or, in the language of random variables 
V (s) = 4>w. We will briefly discuss two possibilities for how to choose an appropriate V () within the 
linear class of functions. The obvious thing to do would be to define Vd — nVo = H(I — r yP)~ 1 rD, where 



n = $£)( < 1'JS c $d) _1 $JS c where we note that the inverse is well-defined by assumption (A.l) This 
formula guarantees that the distance from Vd to Vd is minimal in the L 2 norm weighted by E c . The 
problem with this approach is that it is not known how to compute a useful estimate of the projected 
value function from samples 0. Therefore we need a different approximation. We call it V (•). It comes 
through the equation Vd = HTVd , where we look for the fixpoint of the operator TIT instead of the Bell- 
man operator T, where Tx = Td + "/P&dX- This is motivated further in the next section (see equation 
[3D. 

Now the question, of course, is about the relation between our approximation Vd and the projection 
of the true value function Vd , as we have defined it in the previous section. We now state without proof 
the relation between the two estimates developed in [25] (see their references for prior work). 

V d -Vd = (I-iUP)- 1 (Vd-V d ) (1) 

This can be used to obtain the following bound, which does not require us to estimate the matrices n or 
P (see [55] for proof and for sharper bounds): ||Vb — Vb||s c < (1 — "/ 2 )~ 1 ^ 2 \\Vd — Vd[[h c - We see from 
this that one example where the approximation of equation [6] is appropriate is when we have substantial 
discounting - in that case, if the linear framework is good at all, i.e. if the projection Vd = HVd is close 
to the true value function, then so will be our approximation. We note that this bound concerns only 
one recursive class, which, i.e. the one corresponding to n. It tells us nothing about the other classes 
(i.e. using this bound only, we have to accept the value function at the states belonging to them may be 
arbitrarily off the mark) . 



1 One algorithm that can do that in the limit of infinitely many samples is Least-Squares Monte-Carlo. It is, however, 
prone to high variance in the estimate for small sample sizes. 
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In the subsequent sections, we will describe various seemingly different approaches to computing 
V (•) from samples, which however all lead to the same formula for the solution. In order to derive our 
algorithm, we make two assumptions. First, we assume that the matrix E C [0 T 0] is full rank (A.l) This 



is often understood to mean that the features are linearly independent. This reflects a useful intuitive 
concept, but is not exactly true. We stress that the statement concerns both the features and the 
transition dynamics of the MRP, and means that the parts of the features corresponding to states visited 
with nonzero probability are independent. We note that this implies that the matrix E c [</> T (</> — 70')] 



is also full rank (F.l) - we discuss why this implication holds in section [T9j We will also use this to claim 



the invertability of ^jD^ without further comment (i.e. we assume we have enough samples). Also, 



we assume that the mean of the reward process exists E c [r s ] < oo (A. 2) 

To summarize the description, we restate the fundamental conditions for LSTD to yield good value 
estimates: (1) the linear architecture itself needs to match the problem and the set of features needs to be 
set right, that is Vd must be close to Vd, (2) the approximation Vd needs to be good, for example through 
discounting and finally (3) the sample based approximation w to w must also be good (in the following 
sections we define an consistent estimator for w, i.e. a way to compute w, so that the value function 
computed from a sample trajectory approaches Vd for the recursive states in the class corresponding to 
that trajectory as the length of the trajectory goes to infinity). 

5 Derivation using the Galerkin method 

That LSTD corresponds to a special case of the Galerkin argument has been implicitly realized for 
some time, and formally stated in [3J, on which this section is based. The general idea of the Galerkin 
method is to approximate the fixed point of T, Tx* = x*. We have x* — axgmin x \\Tx* — x\\ . We introduce 
the approximation by considering points from within the column space of $, so that our approximate 
fixpoint satisfies x* £ C(<!>), yielding x* = argmin^gg^) \\Tx* — x\\, which is equivalent to the following, 
after substituting $y* for x* and for x. 

$y* = argmin ||T$y* - $y\\ (2) 

v 

Now, for the L 2 weighted semi-norm with the corresponding projection operator LI, this has an analytic 
solution: $y* = IL(T($y*)). Now, in our case $ = $d, LI = $ £) (<I>JS C $ £ )) _1 $JS C where we note 



that the inverse is well-defined by assumption (A.l) the evaluation of the operator T at &w becomes 



td + jP$>dw with w assuming the role of y* and the norm || • || becomes the weighted norm || • ||s e . Now 
we solve the following. 

§dw = U(r D + ~/P$dw) (3) 

T$ D W 

This can be transformed in the following way. 

t&bII - 7 ($JS c $ I) )- 1 $5s c P$z 5 )) w =^ s .($JS c $ D )- 1 $5s c r 13 (4) 



In the above, we can cancel out the terms $u, because by assumption (A.l)| <&d has to be of full column 



rank. We then multiply both sides by (<&JS C $£>), to obtain (($J^ C $£>) - 7$Js c P$£)) w = $]}Z c rD, 
which leads to the following. 

w = ($5S C $Z5 - 7 <L JS C P$ D ) ~* <S>DZ c r D (5) 
This is the same as the expression we will obtain in the instrumental variable section. 

6 Derivation through instrumental variables 

We will first obtain a statistical model that expresses the properties of the approximation V (■). 
By solving the Bellman equation directly in the linear approximation regime, we obtain the following 
equation. 

4>w = V(s) = F lc [r\s}+"fE c lv(s') s] + e s = E c [r | s] + 7 E C [0' | s] w + e s (6) 

We note that in the above, we introduce the convention that w is a column vector while the features are 
row vectors. This convention minimizes the number of transposes we have to write. Note that we had 
to introduce the TD error vector e = [e Sl , • ■ ■ ,e Sn ] T = T$dw — &dw and the corresponding random 
variable e s (i.e. the error is a deterministic function of the current state, which is random), since the 
reward vector td and the expected feature vector E c [0' | s] w may not be in the feature space (i.e. the 
column space of $£)). It can be verified using the result for w from the previous section that, the error 



terms satisfy e^ c §D = 0, i.e. it is orthogonal to the feature space (F.2) (indeed it can be seen after a 
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brief manipulation that the condition eS c $£> = is equivalent to the formula [S]), and that consequently 
lie = 0. This is not a derivation from first principles, since we had to use an external argument to verify 
that eS c <I>£) = 0. But given the model of equation [5] it is nonetheless instructive to look at the mechanics 
of how the derivation works because this is the first one to have been proposed for LSTD. 

We now accept equation |5] as a given and give a statistical derivation as provided in the original 
LSTD paper [B], based on methods described in [23]. Now, because we do not observe the expectations 
E C [ 7 V (s') | s] and E c [r | s] in equation H2 but merely samples of an d <$>' we model the residue wrt. the 
expected value as noise, yielding the probabilistic model r s = E c [r \s] + r/ s , where we use assumption 



(A. 2) and <f>' 8 — E c [0' \s] + s s . Note that by definition E c [r7 s | s] — 0. Observe that this implies that 



(F.4) 



E c [»7s | 4>] — E c [E c [?7s | s] | 0] =0 (F.3) by the law of iterated expectation (LIE). Analogously, E c [e s | 0] = 
Thus we can rewrite equation H2 to obtain the following. 

4>w = r s + jcf>' a w - 7£ s u> - r] s + e s or r s = (0 - "f4>' s ) w + js s w + r] s - e s (7) 



Now, we cannot use traditional least-squares to solve this, since the expression £ s = je s w + r/ s may be, 
in general, correlated^ with ~ 70s! so will be the projection error term e and the two will not cancel 
in general. Therefore the noise term £ s — e s may be correlated with with — j<fi' s . Also, E c [e s | s] is 
not necessarily zero. But ordinary least squares (OLS) requires that noise be uncorrelated with input 
variables and that it have mean zero to yield consistent estimates. However, there is still a way to 
obtain a good estimate. More formally, we first need to establish the following properties. First, we have 
E C [0 T 77 S ] = E c [E c [0 T r/ s | 0]] = E c [<£ t E c [j7 s | </>]] = 0, where the first equality follows from LIE and 
the second from fact |(F.3"J| By the same reasoning, we have E c [<£ T e s ] = from fact (F.4) With these 



two properties, we can now multiply both sides of equation [7] by T , which we for this purpose call an 
instrumental variable, and then take expectation, so as to make the noise terms vanish. We also have 
E c [0 T e s ] = by fact |(F.2)| This results in the following. 

E c [0 T r s ] =E c [0 T (0- 7 ^)] ™ + 7 E c [0 T £s ] W + E c [0 T r7 s ] -E c [0 T e s ] (8) 

V v ' 

= o 

Now because we know from fact (F.l) that E C [</> T (0 — 70^)1 ^ s invertible, the estimator w is given by 



the following. 

W = E c [0 T (0- 7 0' s )] _1 E c [0 T r s ] = ($5s c (/-7P)$ D ) _1 $5s c rz, or w - (pjDQsy^rs (9) 

This finishes the formal derivation. We will now give two different intuitive interpretations to the 
instrumental variable method. First, consider the sample equivalent of equation[7J which we now rewrite 
in matrix notation rs — D§su> + Cs ~ es, where by Cs we denote the vector containing the noise terms 
for each individual sample and by eg the sample values of the random varaible e s . Now, as described 
above, we cannot solve it by OLS because of the correlation between the noise and D$s- So we 'fix' 
D$s by projecting it onto the feature space (i.e. the column space of since we know that noise is 
uncorrelated with features. We introduce the projection operator II5 = $g($J$5)~ 1 $J, where we note 



that the inverse exists by assumption (A.l) given we have enough samples. Now our equation becomes 
the following. 

n s r S = n s D<S> s w + UsCs -n s e s or ^^^sT^s^s = jS^y^sT^J^s* (10) 

— >0 as N— >oo 



In the above, we can cancel the terms because $s has, by assumption (A.l)| independent columns if we 



have enough samples. This leads to the same estimator that we derived above. This interpretation is 
known in econometric literature as two-stage least squares (2SLS), because we solve two linear systems: 
first we project Z?$s on the subspace of features and then we solve the resulting modified equation. In this 
context we stress that we would get the same solution if we only applied the projection on the right-hand 
side, e.g. rs — HsD&sw ~ this can be seen by noticing that the choice of w in this equation is unaffected 
by any component of rs orthogonal to the feature space. We also see the direct correspondence between 
this and the projection step in the derivation through Galerkin method - the equation [3] is essentially the 
limiting version of the sample-based equation 1101 



2 Indeed, we have E c [<£g T T7 s ] = 0, E c [</> T e s ] = and E c [</) t tj s 
E c [<£s T <£s] — E c [cp'J E c [(f>' 3 I s]l = t&jHc^x) — <£j,P T S c P<E>£), where the last term does not vanish in general. 



4 



Kamil Ciosek 



Properties of the LSTD algorithm 




TV 



(I-jP)V D 7C((/-7P)*d) 




C{$p)_ 



Figure 1: LSTD can be interpreted as an oblique projection (left) and as a fixpoint algorithm (right). 



7 Two kinds of oblique projection 

There is one more way to interpret the instrumental variable approach. Observe that the equation 
Ilsrs = TlsD&stb, can be rewritten as Tls(D(&sw — r$) = 0. Thus we have that applying the projection 
amounts to solving r$ — D<&sw under the constraint that the projection of the residual on the feature 
space is zero. Therefore LSTD yields the same solution as applying the oblique projection of the rewards 
on the difference between values of successive states (i.e. D$s), along the subspace orthogonal to the 
column space of <E>s (which is the left null-space of $5). See also figure [T] 

Recall the formula for the coefficients of the oblique projection on the columns space of X orthogonal 
to the column space of Y, which is (Y T X)~ 1 Y T . It is easy to verify that putting X = (I — jP)$>r> and 
Y = S c <f>c recovers the LSTD solution. Notice that in this case, the projected vector, X(Y T X)~ 1 Y T 
corresponds to obtaining the 'smoothed rewards' corresponding to the approximate value function (i.e. 
(I — jP)Vd, or what the rewards would have been if there had been no approximation of the value 
function). Now there is also a different way of defining the projection, namely we can project not the 
reward vector but the true value function Q2j]. In this case, setting X = $£> and Y = (I — r yP) T E c ^D 
again produces the LSTD solution w. Notice that in this case the projected vector corresponds to the 
approximate value function. 

Notice that formally speaking, in both the interpretation as a projection of the reward vector and 
the value function, we also need another condition to call LSTD an oblique projection - in order for the 
formula X(Y T X)~ 1 Y T to mean a projection on C(X) orthogonal to C(Y), we need the condition that 
the orthogonal complement of C(X) and C(Y) should be complementary subspaces. We will now claim 
that this is the case in either of the above ways of thinking about LSTD as a projection. To do this, we 
will prove the following statement, for any invertible matrices A,B, where we assume that A is invertible 
and B<&d is full column rank. We denote by k the number of columns in <5>£> (they are known to be 



linearly independent by assumption (A.l)) 



C{A<S> D )^®C{B>S> D ) 



-Bz.$T,A T B$ D z = 



First, we note that the dimension of C(B^o) is k since B<&d is full column rank and the dimension of 
C(A^£>)- L is exactly n — I since A is invertible. The argument in the left-to-right direction is as follows: 
if 3z.^J)A T B^ D z — 0, then there would be a vector, B<fr D z, which is both in C(B$ D ) and C(A^ D ) ± . 
Therefore these two subspaces cannot sum to the n-dimensional space if they share a common vector. 
This contradiction finishes the argument. The argument in the right-to-left direction is thus: there is no 
vector in both C(B^d) and C(A^d)- l , then because of their dimensions they have to sum to the whole 
space K™. 

We now see that the condition -^z.^J^A 1 BQ^z — is fulfilled in the case of LSTD because by 



assumption (A.l) the matrix $]jA T B&d, and hence also &J)B T A$e> has to be invertible. In this 



expression, we can substitute A = I and B = (/ — 7P) H c or alternatively A = I — 7P and B = H c to 
obtain either of the interpretations of LSTD as projection outlined above. We note that in either case, 



B<&d is full column rank by assumption (A.l) together with fact (F.l) and A is invertible since P is a 
Markov matrix. 



8 Decompositions of the loss 

We now present an interpretation of the minimization defined in equation [21 after [T] . We recall that 
the minimization in equation [5] can be rewritten in the following way $y* = argmin y ||T$y* — < I ) y||H c = 
n(T($y*)). Therefore $y*-II(T($y*)) = 0, or ||$y*-n(T($y*))||a c =0. Therefore LSTD can be seen 
as the following. 



y* = argmin||$y - n(T($y))|| Hc 

y 



(11) 



We note that this expression has no recursion and that the minimization is guaranteed to reach the 
optimum value of zero. We can now rewrite the norm as follows ||$y — IL(T(3>y))\\s c = ||$y — T(&y)\\s c — 
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|II(T($y)) — T($y)\\s c , where the equality follows from the Pythagorean theorem and the fact that 
$y — n(T($?/)) and II(T($y)) — T($y) are orthogonal vectors, with respect to the S c -weighted dot 
product, which corresponds to II. We thus obtain the following formula for the LSTD solution. 



y* = argmin \\&y — T($y)||i 

Bellman residual 



|n(T(%))-T($y)|| Hc 



(12) 



We see that the LSTD algorithm minimizes a quantity which is the Bellman residual minus the rcpro- 
jection error on the feature space. We discuss in section [9] the difference between simply minimizing the 
Bellman residual only and the LSTD algorithm. 

Another way to interpret the LSTD loss is to see it as a nested optimization problem |11| , which leads 
to the following two equivalent formulations. First, define the projection in the following way. 



h*{y) — argmin \\$h - T(3>y)\\ Sc 



(13) 



Then we plug this for the definition of II(T($y)) in equations [TTJ and [T^] respectively, giving the 
following equivalent equations. 

y* = argmin ||$y -$/i*(y)|| Ho or y* = argmin (||$y - T{$y)\\ Sc - \\$h*(y) - T($y)\\ Sc ) (14) 
y y 



9 The difference between LSTD and Bellman Residual Minimization 

Instead of constructing the oblique projection as described in the previous sections, we can use a 
simpler algorithm, known as the Bellman Residual Minimization, which corresponds directly to projecting 
the rewards on the differences between successive states (see figure [5]) - i.e. it is similar to LSTD except 
the projection is orthogonal, not oblique. BRM can be interpreted as the un-nested version of the 
optimization from the previous section. 



h* = argmin \\<&h - T($h) 

h 



(15) 



The reason LSTD was originally introduced as an improvement over BRM [6] is that for BRM, we do not 
have a justification in terms of a statistical model similar to the one we had in section [5]- the noise terms 
are correlated, so we cannot use a similar reasoning to claim consistency of BRM. But of course the fact 
that one line of deriving an algorithm doesn't work for BRM does not mean that the algorithm is wrong 
- there may be other justifications available. Interestingly, it can be shown that under our assumption 



(A.l) the two approaches are similar (the argument comes from chapter 4 of |2J). Indeed, we have from 
the previous section (compare equation [TT]) that LSTD is similar except for the presnce of the projection 
n. It is sometimes useful to have formulas that make the difference between the two algorithms explicit 
in different formulations of each algorithm. The algebraic relationships between the two algorithms are 
summarized in the table below. 



LSTD 



BRM 



mm, 
min, 
w = 
min, 



|nr$w - $w|| Hc 

\T$w-$wh - 



nr^w - T<f>w\\ Sc 
i-^p 



w 



\V-$ D w'\ 

iit<s>w 



(/-7P) T n T H c n(/-7P) 



min^, \\T$w - $w\\~ c 
mm w \\T$w - <&w\\b. 

w = ( ' ' _ 

mnw ||V - w D w H(/_ 7 P)T Hc(/ _- 



$dw'\\ 

>/r = $($ 1 L T Z c <i>)- L <P L 1 S c T<f>w 



oblique projection, see 1191 



There has been renewed interest in the analysis of the difference between the two algorithms. One 
argument [19J is that in an off-line setting (i.e. in the situation when the weighing coefficients are different 
from the stationary distribution of the MRP, a scenario we do not consider in this paper) a performance 
bound can be shown about BRM that is impossible to derive about LSTD [TS], on the other hand LSTD 
remains widely used in practice. 

There is yet one more feature that means that LSTD is preferable to BRM is some practical cases - 
while with LSTD, as we have shown above, we only need one sequence of samples of features of states 
and a sequence of samples of reward to obtain an estimate of the value function; but with BRM we need 
to have two samples of the features of states. 

We will now show a way to obtain a sample-based estimate wb of the BRM solution, based on section 
3.1 of [16J. We want to minimize the expectation E c [(4>wb — 4>' w b — r ) 2 ]- We have the sampled fea- 
tures <I>g and the sampled rewards rs- We also have a second set of sampled features <&|. The sampled 
features are produced using the following process: given the trajectory s\,82,..., the features in $5 
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Figure 2: BRM as projection of rewards (left) and minimizing the Bellman residual (right). Cmp. fig. Q] 



are 0(si), <fi{s2), • • • while the features in $| correspond to 'alternative' states s' 2 ,s' 3 ,... sampled from 
P(-\s\), P(-\s2), In other words, the features in $| describe where the MRP might also have gone 
to given a particular previous state. Of course, such sampling is only possible if we have a model of 
the transition dynamics of the MRP. Now, we can write a sample-based approximation to the expecta- 
tion given above as E = j±j Yh=\ (^\( i )' w b ~ 7$s(* + l ) w B ~ rsWX^sW^fl ~ j$ 2 s (i)w B ~ r s (i)), 
where the notation $5(2) means selecting row i of the matrix $5(1) (i.e. the z-th feature in the tra- 
jectory). We can now introduce the notation ^P 1 = &g(l : N - 1) - 7*5(2 : N) and *S? 2 = <J> S (1 : 
N — 1) — 7$|(1 : N), where the colon notation denotes ranges of rows. With this notation, we have 

that w^ 1 ^ 2 w B = wJ* 2T * 1 u) B = J^iLi 1 (®s(*) w b ~ 7*s(* + 1 )wB)($ 1 s( i ) w B - 7$|(«)«>b)- It can 
now be seen after a few rearrangements that E = (wgty lJ ^> 2 wb — rj^ 1 + ^ 2 )wb + rj rs^j — 

■j^j ^Wg(^ lT ^ 2 + ^i^^^wb - rji 1 ^ 1 + I 2 )m; b + rjr s y Taking the gradient with respect to w B 

leaves the us with the system (v]/ lT v]/ 2 + iJ/ 2T ty 1 )w B = (VP 1 + ^ 2 ) T rs, where we denoted by the 
sample-based BRM solution. 

10 Derivation using a linear dynamical system 

This section is based on [15]. We will begin by constructing a MRP which lives in the space of features 
instead of our original state space. We limit ourselves to the class of linear dynamical systems. We need 
to define the matrix F and the vector q, so that a transition from <fi to cj>' (row vectors) is modelled by 
4>F = <fi' , and the reward we expect at <p is modelled by <pq = r. Now we look for the values for F and q 
that model our system dynamics. We have that Q>dF should be approximately equal to P$d and 
to r. We weigh states by S c , giving the following optimization problems. 

F = argmin \\$ D F - P$ D \\s c = argmin trace (($ D F - P$ D ) T Z C ($ D F - P$ D )) 

F F 

q = argmin ||<&d<? - r D \\ Sc — argmin ($d<7 - r D ) T Z c ($ D q - r D ) (16) 
9 g 

These optimization problems correspond to ordinary least squares (generalized to matrices in case of F) 
and the solutions are obtained by projection: $>r>F = HP&d an d ^dQ = IlrD, where the projection 
matrix is defined as in section [5] (n = (f>£)(<f>JS c <I>£)) _1 <I>JS c ) and the matrix cancels with the one 
in the projection, since it is full column rank. Now consider a feature vector <fi. In the new MRP, we 
can compute the value function exactly (i.e. all the approximation has already taken place when we 
constructed the matrix F and vector q). The true value function associated with it is the expected 
discounted future reward: </> X^o(7^) l( 7- Note that the vector of values corresponding to states of the 
original MRP is given by ^DEtotT^)'?' This is obviously a linear combination of columns of <5>£> 
(features) - this is why we introduced the projection in the first place. Thus we have w = X^o(7-^T# = 
1 + 2i^i(7-f )*9 = 1 + iFw , where the last equality is the well known telescoping sum argument. We 
thus have the equation (/ — jF)w — q, which is exactly the same as equation U once we plug in the 
computed values of F and q. Thus we have obtained the same estimator. 

In the above, we assumed that the series X^o(7-^) i( 7 converges. We show a stronger condition, 
namely that the series J2iLo(lFY converges. This follows from the following reasoning. We introduce 
the notation II" = ($JS c $£i) _1 <I>JS c , so that n = <&diT. Now we have that F = H~P$d, <7 = II" ru and 
F k = U~ P(HP) k ~ 1 &£) for k > 1. Therefore to have convergence, it suffices to establish that 7ILP has 
spectral radius less than one. Consider first the case when we have S c > 0. Now, is is well known (see for 
example [1], proposition 6.3.1) that 7IIP is a contraction in the norm weighted by S c ; now, because we 
have S c > 0, this implies that the spectral radius condition holds. In the case where we have transient 
states, we use the following reasoning. We use the block matrix notation from section []j|J Now, because 
P nt in necessarily zero, it can be seen after a brief calculation that the matrix II<I>£> has the following 
block matrix form. 



np 





' 








where 11/ = <S> 1 D (<$>~} D ~J <S> f D y l <S>~] D ~J 



In the above, $/d contains only the columns corresponding to recurrent states and the block matrix 
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denoted with ■ ■ ■ is not important for our reasoning. Now, because we have zeros in the second block 
column it can be seen that all eigenvalues of 11$ u are either zero or they are also the eigenvalues of TlfPf. 
where our spectral radius result holds by the contraction argument as before. Thus we have the result 
for the general case. 

Note that in the above proof we used the fact that the distribution that S c has on the diagonal is 
a left eigenvector of P corresponding to eigenvalue one. If S c used for the projection were an arbitrary 
distribution, then the matrix F would in general have spectrum beyond the unit circle (see [5] for a 
concrete example). 

We also stress the interpretation of the matrices F and q in terms of expectations, as given below. 
F = (^lE c ^ D )-^l3 c P^ D = E c _1 E c [<j>l4>' a ] 

q = (^lE c ^ D )-^ T D E c r D = E c [0j0 8 ] _1 E c [0jr] 

We wish to note the applicability of the described definitions of F and q to the setting where we have 
a single transition matrix P, but instead of just one reward we have many tasks, each of which with a 
different reward |15] . We note that in this setting, while we still have to learn q for each task separately, 
it is worthwhile to learn F using training data from all tasks. 

We also want to stress the fact that we can construct sample-based variants of the matrices F and 
q (call them F and q respectively) and still obtain the same algorithm. Let us adopt the following 
definitions. 

F = argmin \\§ S F - N<S> S \\ = (^J^s) -1 ($J N$ s ) 

F 

q = argmin - r s || = (S^s)" 1 ^]^) 
In the above, we denote by N the matrix which has ones above the main diagonal and zeros elsewhere. 



We note that the required inverse exists by assumption (A.l) It can be easily verified that the estimate 



(I — 7.F) 1 q is consistent with what we gave in the section about instrumental variables 

11 Derivation from the iterative TD algorithm 

We have seen in section [5] that the equality E c [</> T e s ] = — $JS c rD + $JS C (7 — -fP)§rjw = is crucial 
for the development of the algorithm and indeed equivalent to the obtained estimator for w (equation [5]). 
We will now show another way of obtaining this equality - actually, it may be taken do be the definition 
of the algorithm, and used as a justification for the formula [3] that stands on its own. We now give the 
interpretation of this equation is in terms of the iterative TD algorithm [23J . We note that the equality 
= E c [</> T e s ] corresponds to saying that the LSTD solution corresponds to the fixpoint of iterative TD, 
i.e. the point where the expected update is zero. 

Consider now the definition of the iterative TD algorithm [S3] . We assume for the moment that we 
have an oracle for the value function and are interested in iteratively solving the optimization problem 
mm w (V (s) — V (s)) 2 using the approcimation architecture V (s) — 4> s w. The iterative update is given 
by V w (V (s) — V (s)) 2 = 2S7 W V (s) (V (s) — V (s)). We now have the following formula for the iteration. 

Aw cx y w V (8)( (r t+1 +'yV (s t+1 )) - V (s t )) 

(fr(s t ) T oracle for value 



TD error e„ 

Now we have that the update Aw at time t, is <fi(s t ) T e 3t . Setting the expectation of this update to 
zero gives the desired formula. We also note that the relation between the TD iteration and the LSTD 
algorithm resembles the chicken-and-egg problem - one can either, as we did above, consider the iteration 
a priori knowledge and use that to justify the LSTD fixpoint, or one can start with the fixpoint and treat 
the iteration as a way of reaching it, motivated by stochastic optimization. LSTD can also be extended 
to compute the fixpoints of TD(A) or, more generally other similar algorithms with different traces. For 
details, see [7] in slightly different notation. 

12 Interpretation in terms of minimizing a quadratic form 

This section is based on |22| . It interprets LSTD as the minimization of a quadratic form in the 
error between the true value function V {■) and the approximated value function $bio, We begin by 
reformulating the formula for the estimator obtained above. 

w = ($JS C (7 - jP^d)' 1 <S>lZ c r D = 

= ($5(/ - 7 p) T s c $ D ($Js c $ D )- 1 $5s c (/ - -fP^oy 1 $5(7 - 7 p) t s c $ I) ($Js c $ d )- 1 $Js c (/ - 1 P)V 
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This equality holds because m = {I — "fP)V an d because the matrices $>J)(I — jP) T ^ c &d and ^J)^ c ^>d 



are invertible by assumption (A.l) Now. we introduce the matrix K, as below. 

k = (i- 7 p) T s c $ D ($5s c $ D )- 1 $5s c (/ - 1 p) = (i- 7 p) T n T s c n(/ - 7 p) 

We note that S c $£i(<i>JS c $£i) _1 $JS c = S C I1 = I1 T S C = n T S c Il, where the last equality follows by sub- 
stituting the definition of II and canceling the inverted term. Therefore we have w = ($ Jif$ D ) <5> T D KV . 
But this is the solution to the well-known optimization problem: w = argmin^, \\V — &dw'\\k = 
argmin,^, (V — $flio') T K(V — $>dw'). Thus we gain an insight about approximation V (•) of equa- 
tion [5] - instead of minimizing the norm |j • |~ c , which would yield us Vd, we minimize the different 
norm || • \\k, thus gaining the ability of efficiently estimating the solution from samples. Note that we 
can also repeat the above reasoning, without the multiplication by ($JS c <i>£)) -1 , to obtain the ma- 
trix K' = (I — 7P) T S C $£>$JS C (/ — jP) which also defines a valid minimization - this is the way the 
equivalence was originally introduced in |20J. 

13 LSTD is a subspace algorithm 

In section we have shown that the algorithm can be thought of as an oblique projection along the 
subspace orthogonal to the feature space. Here, we make explicit the property that LSTD only depends 
on the features through the subspace they span i.e. any full-rank transformation (i.e. basis change) C of 
features does not influence the value function. To see this, consider the sample estimate we derived in 
earlier sections, where we use the transformed features <I>sC instead of 

V c = <f> s Cw c = $5C(C T $sLi$5C) _1 C T $Jr s = 

= $ s CC" 1 ($jLi$s) _1 C T_1 C T $Jrs = $ s (<f > l;D<5> s )- 1 <p];r s = V 
As a corollary, we state that LSTD is independent of any scaling of features. 

14 The case with the singular matrix 

We are now concerned with the case where the assumption (A.l)| is not met, i.e. E c [e/> T (</> — 70')] 



is not invertible. It then follows by fact (F.l) that the matrix E c [0 T </>] is also not invertible. We 



will now argue for that in such a case, taking the pseudo-inverse instead of the inverse of the matrix 
E C [</> T (0 — 70')] in the LSTD formula produces a good estimate of the value function. To do this, we 
will discuss two cases in which the matrix E c [</> T </>] may be singular. In writing this section we had to 
assume a conjecture, outlined in section [2T1 which we strongly suspect is true but have no proof for it. 

First, it may be the case that <&d has dependent columns. We also assume that S c > (we will deal 
with the case where this is not the case in the next section). We introduce the matrix C, which selects 
a basis from the columns of so that <1>£>C has independent columns and C is a skinny matrix with 
columns from the identity matrix. We note that the matrix C T< i>JS c <i>£)C is invertible. We now wish to 
prove that V c = ® D C(C T <S>l~ c {I - 1 P)<S> D C)- 1 C 1 ~$lE c r D = $ D ($JS C (J - 1 P)^ D )+^ c r D = V+. 
Now if we assume that our conjecture about oblique projection holds (see section we can see that 
this follows automatically since C($d) = C(<S> D C), C{{I - ~/P) T Z c <£ D ) = C((I - ~/P) T Z c <P D C). 

Having examined this situation, we now assume that the columns of $c are independent. Algebraically 
speaking it could happen that for some vector v ^ 0, we have $d» ^ but ||$£>i;||h c = 0, since the 
notation || • || ~ c only denotes a semi-norm, i.e. the vector $do is non-zero only in states to which S c assigns 
weight zero. We will now argue that this is impossible given how we construct our matrix. Consider 
the matrix E whose rows are rows of the identity matrix which eliminates the transient states, so that 
we have <E>|} = E&d, where $^ is the same as $£> except it only contains rows corresponding to the 
recurrent states. We similarly have = _BS C _B T . Now, we can see from the interpretation of the matrix 
$J2 C <I>£) as the expectation E c [0 T </>] that simply removing the states with zero weights (the transient 

states) does not change the value of the expression, so we have <I>JS c $£i = <f>|^ T Sf <i>^. But the second 
matrix is invertible. 

15 Regularization 

To overcome the problem of over-fitting, the standard procedure is to add a regularization term to 
the proposed algorithm. There are many ways of doing that. 

One way, proposed by |14] is to consider the optimization problem of the fixpoint equation [5] 
We can extend it as follows: $e>w = argmin^/ (||ro + jP^jjw — &dw'\\-e c + j3\\w'\\). Here, j3 > 
is an external parameter of the algorithm and both norm expressions are the usual Li norm, the 
first one weighted by 3 C . This way of regularizing produces the well-known analytic solution wr — 

c r D . In the paper [14J, a version is also given where the second norm is 
L\. In this case, because equation [2] is a fix-point equation, it is not possible to simply plug the problem 
into the standard LASSO algorithm, and a new algorithm is necessary (see [TJ for details). 
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Before we continue, denote the standard ^-regularized solution of a system of equations Ax = b as 
solveL 2 (A, b, f3) = argmin^ \\Ax - b\\s c + f3\\x\\ 2 = (A T E C A + (3I)~ 1 A T E c b. Denote the version with L\ 
regularization as solve^ (A, b, (3) = argmin^, \\Ax — 6||h c + (this has no explicit analytic form as has 

to be computed using an algorithm, typically LASSO). A second way of regularization, introduced in 
is to add regularization to equation 1141 giving the following optimization problem. 

y* = argmin ®h*{y)\\s c + ||y||ior2 

v 

In the above, the final norm may be either of Li or L\. A quick calculation shows that this is the same 
as regularizing the system of equations 01 This idea therefore corresponds to the solutions solve($c(^ — 
7($JS c $ £) )~ :L $JS c P<l5 £ ))), $n($JS c $_D)~ 1< i ) JS c rD,/3) for each of the discussed norms. 

Another way is adding regularization directly to the equation where we have already solved for w, 
that is, w — A~ x b, where A — ($ JS C (7 — jP)$>d) and b = $JS c rrj. If we regularize with L^. this 
corresponds to the solutions so\veL 2 (A,b, (3). This (together with other versions, that do not map to 
LSTD), has been done in [2], where the author also derives finite-sample error bounds. 

It is also possible to combine some of the above ways together, after the manner of [13], and to use 
other sparsifiers in place of L±. In |12j . for instance, the Dantzig selector is employed, which leads to a 
considerable simplification of the optimization problem (the optimization reduces to a linear program). 

A yet different approach [T7] to regularization is to keep the algorithm itself unchanged and instead 
do feature selection beforehand. Even if the feature selection algorithm is very simple (greedy based 
on correlation with residual), simulations [17J suggest that doing feature selection leads to performance 
essentially the same as the approaches described above. Because greedy feature selection is so simple, 
this suggests that regularization of LSTD is not yet really a fully solved problem. 

We now describe a different way of regularization. The idea is to consider the linear-system interpre- 
tation of LSTD and to regularize the linear system. A simple way of doing this, proposed in [5], is to 
add an additional rank constraint (over the matrix F) to the minimization of equation 1161 to obtain the 
following. 

argmin \\®dF - P$d\\s c subject to rank(F) < k (17) 

F 

This regularized F does not have a closed form any longer, but can still be obtained efficiently using SVD, 
in the following way. In the notation of section 120.21 we have that X — $rj and Y — P$d- Denote as U, 
A, V the singular value decomposition of the matrix Z C 1/2 $ D (Z C 1/2 $ D ) + Z C 1/2 P$ D = S C 1/2 ILPS C $ D . It 
can be shown (see [5] and 120. ll that the best rank-/c approximation F is given by the formula FVSkV T , 
where F is the full rank solution, the diagonal matrix Sk has ones in the first k diagonal entries and zeros 
elsewhere. Observe that this is equivalent to leaving just the first k columns of the matrix V (call the 
resulting sub- matrix 14) and using the formula FVkVjJ . This formula has an interesting interpretation 
- it can be seen as a product of two rectangular matrices (FVk)(V k T ), the first one skinny and the other 
fat. The process of approximating the next feature using this matrix can be thought to consist of two 
stages: compressing the feature vector into a shorter one (of length k) and then decompressing it again 
to full size. Notice that the compression / decompression operators are not defined uniquely, even if all 

i -i /Q 

the singular values of the matrix S c ' LIPS C $£) are distinct - we can insert any basis change matrix B 
and still obtain a valid decomposition (FVkB)(B^ 1 V k T ). In short, this regularization procedure doesn't 
really compute latent (or compressed) features, but rather it computes the latent subspace. 

Here, we have given the design (or limiting) variant of the regularized algorithm, i.e. the one that 
uses the formal specification of the MRP. It is of course possible o give a sample-based version. This can 
be done as follows. We start with the Slowing problem. 

F = argmin \\&sF - N$s\\ subject to rank(F) < k 
P 

We can then apply the mechanics described in section [203] We note that the rank constraint only affects 
the construction of F, not q. 

The idea described in the previous paragraphs is not to be confused with simply performing the PCA 
in the feature space before invoking the algorithm, which amounts to first computing the SVD of the 
matrix Sc 1 / 2 *^ = UAV T , to obtain the matrix V4 which contains the first k columns of V and then 
constructing the compressed features as $DVkV k T . This process, however, does not take into account the 
dynamics of the MDP - we see that none of the quantities involved depend in any way on the matrix P. 

A property of all the above regularizers is that we lose the invariance of the algorithm w.r.t. the 
choice of basis for the feature space, which can be seen as a natural characteristic of lste|. It is not 
clear whether the property would be worth preserving in a regularized version - sparsity by its very 
nature is not invariant to transformations of features, even linear ones and there is a general tendency 
that a more specialized algorithm will have less generic properties. 

3 Indeed section 4 of |13 | deals with how to perform standardization of features before plugging them into optimization. 
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16 An attempt to derive the asymptotic covariance of the computed estimate 

This section is based on the standard method for deriving covariances in the instrumental variable 
setting, as described in [23] . Unfortunately, this method does not lead to a usable covariance estimate in 
our case. It is nonetheless instructive to look at how the derivation would work and where it breaks. We 
assume S c > 0, which is also the condition for consistency. In the preceding sections, we have obtained the 
estimator for w as E c [0 T (0 — 70 s )] E c [0 T r s ] . The sample-based equivalent w is (<&J D^s) 1 ^ r S — 
(<I>JZ3<I>s) _1 <l>5 + Cs - es) = w + D® s )~ 1 ®J(Cs - es)- Now, we are interested in estimating 

the covariance structure of w. We need to make the additional assumption that E c [r^l < oo, i.e. variance 



of the reward process exists (A. 3) Note that this implies that E c [£|] < oo. Now consider the distribution 
of the term iV _1 / 2 $J (Cs — eg). We cannot apply the central limit theorem (CLT) directly, because by 
the dynamics of the MRP, the elements in the sum $JCs are n °t independent samples. Therefore we 
now rearrange the terms to obtain the following. 

N-^siCs ~ e s ) = N-i/i £f =1 MCs, - e 8i ) = N~W Eli <P* (E i:ai=s C*) - ^ E^Ii ^ = 

= N-^ Etr cl /2 ct> s (V, : . . C,) cS 112 ~ N-^ E f =i 0ils 
* , ' " . ' 

goes to N(0, E c [Cl] ) goes to JV(0, E c ^0 T 0ej] ) 

In the above, we applied the CLT to the sum of £ Si in each state separately, since, in a given state, they 
are independent. The terms c s stand for the expected frequencies of visiting each state, c s = N^ s . So 
we have that the distribution of N^ x ^ 2 ^JCs converges to a Gaussian with mean zero and covariance 
Ef = i</>J</> s E c [Cs] CsN- 1 -> E c [Cs0 T 0] (note that this is the same formula we would have got if we 
had wrongly assumed samples in first sum were i.i.d. and applied the CLT). In the second term, we can 
apply the CLT because the sum is equivalent to drawing from a multinomial distribution corresponding 
to the states of the MRP, where the probabilities are given by the stationary distribution and the mean 
E c [0e a ] is zero because of the property we discussed at the beginning of section [5] 

Unfortunately, the argument breaks because we cannot claim that iV -1 / 2 $J(£s — es) converges to a 
Gaussian with mean and covariance B given by B — > E c [(£ 2 + e 2 i )0 T 0] - we cannot assume anything 
about the dependence of the Gaussians that come from our two applications of the CLT. On the other 
hand, if we had been able to obtain the desired covariance B, we could assume (heuristically, but as 
is customary in asymptotic covariance derivations) that N~ x / 2 (w — w) is normally distributed with 
covariance E c [0 T (0 — 70s)] B E c [(0 — 70' S ) T 0] . This means we could (in the asymptotic limit) 
treat w as normally distributed with the following covariance. 

N E c [0 T (0 - 7 0^)] ^ B E c [(0 - 70i) T 0] _1 

The expectation term is easy to estimate from samples (and we need to compute it anyway to get the 
estimate of the value function). The problem is in the estimation of B. If we had access to an estimate 
B, this would give rise to the following estimate for the covariance of w. 

£ = ($] S D$ S )- 1 B($] S D T $ S )- 1 

Given the design matrix $a, we could then compute the covariance of the value function as 
In practice, we are unlikely to want to compute this using the whole design matrix. Instead, we will pick 
a set of interesting states and use a matrix whose rows correspond to them. Unfortunately, we don't 
know how to compute an expression for B, let alone an expression for B, estimable from samples. 

17 Bayesian interpretation of covariance 

We now follow the analysis of [TU] and define one possible probabilistic model for LSTD. We start 
with equation ($J_D<I>s)i() = Qjrs- Consider an arbitrary symmetric matrix G. We can multiply both 
sides with ($JZ?$5) T G to obtain the following. 

(^D^s) 1 "G($jD$ s )w = ($JC$ s ) T G$Jr s 

The w which solves this equation is the one which minimizes the following quadratic form. 

min || rg — Z?$s«)| 

w 

This is the same as the Gaussian maximum likelihood problem where P(rs \w) is proportional to exp { — ^ ■ } 
of the above, i.e. with the precision matrix ^sG^J. Now we can either solve this directly or we may 
optionally introduce a prior over w (assume a Gaussian prior with mean zero and precision L), to obtain 
the following. 

P(r s \w) cx exp j-^l! ?, s - £>$s^||<t, sG <i>J + IHU 
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Now, if we take as our matrix G the hypothetical inverse covariance from the previous section, we 
obtain the interpretation [21] that the variance of our values is the mean projected Bellman error. This is 
appealing because it models our intuition that the projected Bellman error reflects our uncertainty about 
state values. But it should be noted that this correspondence is essentially defined and not derived in 
the preceding argument. It is not a deeper fact about the LSTD algorithm - indeed we could plug any 
other covariance matrix as G and obtain a similar result. 

18 The episodic version of LSTD 

In the other sections of this paper, we have considered the case where the MRP never terminates 
and convergence is defined by taking the limit with respect to the length of a trajectory. We are now 
interested in extending our observations to the case where there is a termination state. The limit will 
now be the with respect to the number of episodes being accumulated. First, let us note that the formula 
w = E c [<fi T (<fi — 70' s )] E c [</> T r s ] is still valid in this case. We simply have to give new meaning to the 
expectation terms. 

We will now start by giving a design-based variant for the algorithm. All transitions in a terminating 
MRP can be described using a rectangular matrix P t , where the last column is meant to denote termi- 
nation. We assume in the following that the starting state of the MRP is the first state. We also assume 
that the matrix P t is such that the MRP will always eventually terminate. We first need to construct 
a state distribution 5. To do this, we append the row [1, ... 0] to the matrix P t , producing the square 
matrix P a , which assumes that the MRP restarts after reaching the termination state. Now, the diagonal 
entries of the matrix 5 are the entries of the left eigenvector of P a which corresponds to eigenvalue one. 
Now we also construct another square matrix, P, which we obtain by appending the row [0 ... 0, 1] to 
the matrix P t . This matrix assumes that the agent stays in the termination state forever. The intuition 
behind this is the following: the matrix P describes the true dynamics of the MRP, but in order to have 
a meaningful state distribution we need to take into account the fact that we have multiple episodes - 
hence the definition of the matrix P a , which models restart. Having defined the above matrices, we may 
use the standard formula in the following way: w = ($JS(I — r yP)^o) ^JSr^. Here, we assume 
that the last feature vector (i.e. the one corresponding to the state modelling termination) is zero. By 
definition, the final element of ru is also zero. 

It can be seen that the sample-based variant is the same as in the case of one long trajectory, except 
for the additional summation over the episodes. We note we use here the fact that the termination state 
has the feature of zero (so that we can still use the matrix D - there is no subtraction in the last row, 
but it doesn't matter since the last state is the terminal state). The formula looks as follows, where the 
sum goes over episodes. 



19 A fact about assumption (A.l) 



We prove that (A.l) implies (F.l)| We rewrite them in matrix form: det($JS c $£)) ^ and 
det(<i>JS c (/ — 7P)<f>£>) 7^ 0. We will now develop the second expression. By the well-known eigen- 
value argument, / — 7P is invertible. Assume for the moment S c > (we will deal with the case when 
this is not true later). Consider some non-zero vector x. We have that $JS C (J — jP)$rjx = if and 
only if the vector y = which in the column space of <!>d satisfies the condition that S C (J — "fP)y 

is orthogonal to the column space of <!>£>. This implies that y T S c (7 — r yP)y = 0. This holds if and 
only if y T (i(S c (7 — 7P)) + ^(S C (J — r yP)) T )y — 0. Now because the matrix defining this quadratic 
form is symmetric, and thus diagonalizable and with real eigenvalues, we have that this can only be 
zero if some of the eigenvalues are nonpositive. We will show that this cannot be the case. Rewrite 
the matrix ±(S C (7 - 7P)) + |(H C (J - 7P)) T as E C (I - 7±(P + S C _1 P T S C )). Now because by definition 
E c = diag(£) where £ T P = £ T , we have that S c ~ P T S C [1, . . . , 1] T = [1, . . . , 1] T ; moreover, S C _1 P T S C 
has positive entries. So it is a Markov matrix. Thus |(P + E c P T H C ) also is a Markov matrix. Thus, 
(I — 7^(P + S C _1 P T S C )) has eigenvalues in the positive real half-plane. But this matrix is obviously 
similar to E C (J — -f^(P + E C ~ 1 P T E C )) by the transformation Sc" 1 / 2 , so it has the same eigenvalues. This 
finishes the proof for S c > 0. Now consider the case when we do not have this, i.e. some of the diagonal 
entries of S c are zero. Intuitively, the fact we prove is now obvious since transient states do not influence 
the values of the expectations, except when the features of non-transient states are linearly dependent, 



which would violate (A.l) More formally, we can, without loss of generality assume that the states for 
which the probability given by the stationary distribution is zero have highest indexes (i.e. they occur at 
the back of matrices 3 C , P and <&£>). We introduce the following notations for block minors of matrices 
H c , P and the vector y corresponding to the non-transient and transient states. 





r - f 

1 — <c 


" 











p = 



' Pf 


Pnt 


Ptn 


Ptt _ 



Vf 

yt 
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Note that in the above, P nt is has to be the zero matrix - it corresponds to transitions from non-transient 
states to transient states. Therefore we have that S c (7 — jP)y = implies {I — "fPf)yf and thus, by 
the reasoning for the case without transient states, yj has to be the zero vector, which violates assumption 



(A.l) Thus S G (J — jP)y is nonzero, for all nonzero vectors y in the column space of <&£>. Therefore for 
WJjE c (I — jP)$dx to be zero, we need the vector H C (J — r yP)$DX to be orthogonal to the column space 
of In particular, this implies y T S c (7 — "fP)y — 0. Again, this simplifies to yjsj (I — jPf)yf = 0. 
Again, by the reasoning for the case without transient states, this implies that yj is the zero vector, which 



violates assumption (A.l) Thus we have the desired result. 



20 Summary of results about reduced-rank regression 

In the following sections, we summarize the results from literature concerning the algorithm for 
obtaining a solutions for reduced rank regression problems. 

20.1 Reduced Rank Regression as as two-stage procedure 

We first give the two-stage argument as outlined in [9 . Consider the optimization problem argmin F || XF- 
Y\\. The full-rank solution is Ff — X + Y . which corresponds to the approximation Yf — XFf — XX + Y. 
Consider now the SVD of Yf = UAV T , where we assume that the singular values are decreasing. A 
reduced-rank approximation to Yf of rank s can be obtained by stetting the smallest singular values 
to zero, leaving only s singular values, which yields the matrix Y s — UAI S V T , where we denote by I s 
the matrix with ones in the first s diagonal entries and zeros elsewhere. Of all the matrices with the 
required rank, this is the matrix closest to Yf in terms of the Frobenius norm. We will now construct 
a matrix F s , which leads to this approximation, i.e. Y s = XF S . Define F s = FfVI s V T . We have 
XF S = XF f VI s V T = Y f VI s V T = UAV T VI S V T = UAI S V T = Y s as required. This can be interpreted 
as a two-stage process because we first find the full-rank approximation and then perform the SVD. 

20.2 Weighted reduced rank regression 

We are now concerned with the optimization problem as given below. 

argmin F \\XF - Y\\ s = 

argmin F trace ({XF - Y) T E(XF - Y)) = 

argmin F trace ((E^XF - ~}/ 2 Y) T {~}l 2 XF - ~}' 2 Y)) 

We see from the above that we can solve the weighted regression problem argmin F \\XF — F||a by 
computing the solution of argmin F \\E l ^ 2 XF — S 1 / 2 !^. 

21 A conjecture concerning the oblique projection matrix 

To show our result concerning the pseudo-inverse, we will need a property of the oblique projection 
matrix that is probably true, but which we have been unable to prove. The property is as follows. Define 
the matrices P = X{Y T X)^Y T and P c = XC{Y T XC)^Y T . Assume C(X) = C{XC)). The desired 
property is that P = Pc- In the case where the matrix Pc is regarded as an oblique projection matrix, 
the interpretation is that the result of a projection only depends on the subspace along which we project 
and the subspace orthogonal to which we project, not on the choice of basis. Notice also that the property 
implies, if we also assume C(Y) = C{YC), that P = XC(C T Y T AC)+C T y T , by applying the original 
property to Pq . 
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